Day 21：自動擷取摘要(Automatic Text Summarization)

2018 iT 邦幫忙鐵人賽

DAY 21

AI & Machine Learning

以100張圖理解 Neural Network -- 觀念與實踐系列第 21 篇

2018鐵人賽 neural network machine learning ai

I code so I am

2017-12-31 10:27:22

14619 瀏覽

分享至

前言

現在人身處網路時代，每天都會收到一堆LINE、Email、Facebook、Instantgram、...等等五花八門的訊息或網頁，花整天看都消化不完，只好來個已讀不回、封鎖群組或者乾脆直接刪除，如此可能因而錯過一些重要的訊息，面臨這種『資訊過載』(Information Overload)的現象，如果，有個程式能像個小秘書，先幫我們讀過所有內容，並整理成摘要，那我們就只要快速瀏覽摘要，有興趣的項目再超連結(Hyperlink)進去，看詳細內文即可，就比較不會漏掉重要資訊了，因此，這次我們就來探討這個主題 -- Automatic Text Summarization，因為，中文直譯不好意會，我就擅自將標題改成『自動擷取摘要』，避免爭議，以下還是直接以英文引用。

記得幾年前新聞曾經報導，英國一位15歲青少年，寫了一隻程式，專門針對每日各大報新聞，擷取摘要，還吸引不少人訂閱，我們就來看看怎麼作，特別是以 Neural Network 方式處理。

用途

『A Gentle Introduction to Text Summarization』列舉了 Automatic Text Summarization 許多用途：

新聞標題
筆記大綱
會議記錄
電影預告(Preview)
戲劇前情提要(synopses)
書籍/電影回顧(Review)
摘要(Digest)
自傳、履歷
莎劇片段節錄(abridgments)
天氣預報/股市財報重點摘要(bulletins)
政治演說的片段
歷史上的今天(chronologies of salient events)

作法(Approach)

同時，Automatic Text Summarization 也有助於『問答系統』(Question-Answering system)的發展，因為，如果能掌握問題的大意，才能作適當的回答。一般而言，可分為兩種作法：

萃取法(Extractive Method)：從本文中挑選重要的字句，集合起來，成為摘要。
抽象法(Abstractive Method)：瞭解本文大意後，自動產生摘要。
Taming Recurrent Neural Networks for Better Summarization有一個很妙的比喻，『萃取法』好比一支螢光筆，將重要的字句標註起來，『抽象法』就像一支原子筆，消化本文後，寫出自己的字句。

前者較簡單，較易成功，後者較難，但是它較接近人類思考的方式。Neural Network 模仿人類為主，故大部分採取後者，Facebook、IBM、Google都有相關論文發表，三巨頭都到齊了，這題目應該算熱門吧。

Neural Network 作法

目前比較普遍的作法是與『[Day 18: 機器翻譯(Machine Translation)]』(https://ithelp.ithome.com.tw/articles/10194403) 類似的方式，採取 RNN seq2seq 演算法，以Encoder-Decoder的方式，以問題當編碼器encoder，再預測(解碼)答案。

萃取法(Extractive Method)

我們先來看『萃取法』(Extractive Method)的步驟，詳請請參閱『Text Summarization Techniques: A Brief Survey』：

建立具代表性的中介字句(Intermediate Representation)，以能表達本文的大意。
幫這些中介字句打分數。
依分數挑選字句，彙總而成摘要。

我們來看個範例程式，程式來源為『NLTK Essentials』一書第五章，它是使用 NLTK library，未牽涉到 Neural Network，部分技術可參閱上一篇說明。

import nltk

# 本文，大意是歐巴馬卸任
news_content='''At noon on Friday, 55-year old Barack Obama became a federal retiree.
His pension payment will be $207,800 for the upcoming year, about half of his presidential salary.
Obama and every other former president also get seven months of "transition" services to help adjust to post-presidential life. The ex-Commander in Chief also gets lifetime Secret Service protection as well as allowances for things such as travel, office expenses, communications and health care coverage.
All those extra expenses can really add up. In 2015 they ranged from a bit over $200,000 for Jimmy Carter to $800,000 for George W. Bush, according to a government report. Carter doesn't get health insurance because you have to work for the federal government for five years to qualify.
'''

# 分詞、標註、NER、打分數，依分數高低排列句子
results=[]
for sent_no,sentence in enumerate(nltk.sent_tokenize(news_content)):
    no_of_tokens=len(nltk.word_tokenize(sentence))
    # Let's do POS tagging
    tagged=nltk.pos_tag(nltk.word_tokenize(sentence))
    # Count the no of Nouns in the sentence
    no_of_nouns=len([word for word,pos in tagged if pos in ["NN","NNP"] ])
    #Use NER to tag the named entities.
    ners=nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)), binary=False)
    no_of_ners= len([chunk for chunk in ners if hasattr(chunk, 'label')])
    score=(no_of_ners+no_of_nouns)/float(no_of_tokens)
    results.append((sent_no,no_of_tokens,no_of_ners, no_of_nouns,score,sentence))

# 依重要性順序列出句子
for sent in sorted(results,key=lambda x: x[4],reverse=True):
    print(sent[5])

執行及說明

程式可自『Packt』下載，程式名稱為 summarizer.py，原程式有些typo，我修正後再加幾行註解，可自這裡下載。

在DOS內執行下列指令：
python summarizer.py

程式很短，大約10行，處理流程如下：

先將文章分詞，變成單字的List。
從List標註名詞(NN、NNP)。
再以 NER 找出代表人/組織/地點的重要單字。
依NER找到的數量打分數。
最後依分數高低排列句子

依重要性順序輸出句子如下：

At noon on Friday, 55-year old Barack Obama became a federal retiree.
The ex-Commander in Chief also gets lifetime Secret Service protection as well as allowances for things such as travel, office expenses, communications and health care coverage.
In 2015 they ranged from a bit over $200,000 for Jimmy Carter to $800,000 for George W. Bush, according to a government report.
His pension payment will be $207,800 for the upcoming year, about half of his presidential salary.
Carter doesn't get health insurance because you have to work for the federal government for five years to qualify.
Obama and every other former president also get seven months of "transition" services to help adjust to post-presidential life.
All those extra expenses can really add up.

標題排在第一位，乍看之下，好像還蠻合理的，含金額數值的句子都排在前面(我比較愛錢XD)。自動摘要的簡單作法，就是選取前面幾句當摘要，這就是簡化的『萃取法』(Extractive Method)。

下次，我們就來看看抽象法(Abstractive Method)，也就是 Neural Network 作法。